clusterProfiler

Tim Vigers

March 23, 2023

Functional enrichment

  • Methods for interpreting results of various high-throughput omics studies.
  • Most omics analyses result in a list of genes (or proteins, metabolites, etc. that are associated with a gene) that are differentially expressed.
    • For example, comparing protein levels between two groups.
  • How do we understand a list of differentially expressed genes in biological context?

Functional enrichment terminology

  • Gene set: an unordered collection of functionally related genes (a pathway)1
  • Gene ontology (GO): a formal representation of three aspects of biological knowledge2
    • Molecular function
      • E.g., “catalysis” or “transport”
    • Cellular component
      • Either cellular compartments (e.g. “mitochondrion”) or stable macromolecular complexes (e.g. “ribosome”)

Functional enrichment terminology

  • Gene ontology (continued):
    • Biological processes
      • Larger processes made up of multiple molecular functions.
      • E.g., “signal transduction” or “glucose membrane transport”
      • Not necessarily equivalent to a pathway

GO graph

  • Organized as a directed acyclic graph (DAG)
  • In general, child terms are more specific than parent terms
    • E.g. biosynthetic process is a subtype of metabolic process and a hexose is a subtype of monosaccharide.

GO graph example

KEGG

  • Kyoto Encyclopedia of Genes and Genomes
  • A “manually curated database resource integrating various biological objects categorized into systems, genomic, chemical and health information.”3
  • Sixteen databases in four broad categories:
    1. Systems information (includes the pathway database)
    2. Genomic information
    3. Chemical information
    4. Health information

KEGG Gluconeogenesis Pathway

More categories!

KEGG pathways can be further subdivided into:

  1. Metabolism
  2. Genetic Information Processing
  3. Environmental Information Processing
  4. Cellular Processes
  5. Organismal Systems
  6. Human Diseases
  7. Drug Development

Other gene sets

  • GO, KEGG, and Reactome4 are the most frequently used1
  • Can use any pathway database for these analyses
  • Alternatives include:
    • Disease Ontology (DO)
    • Disease Gene Network (DisGeNET)
    • wikiPathways
    • Molecular Signatures Database (MSigDb)
  • Unfortunately, the pathway database can have a significant impact on results

Over representation analysis (ORA)

  • Takes a list of differentially expressed genes and tests whether genes from various pathways are present in the list more often than expected.
  • Usually a hypergeometric test5

Over representation analysis (ORA)

\[ p = 1 - \frac{\sum_{i=0}^{k-1}{M\choose i}{N-M \choose n-i}}{N \choose n} \]

  • \(N\) is the number of genes in the background distribution (usually all annotated genes)
  • \(n\) is the number of genes of interest
  • \(M\) is the number of genes annotated to the particular gene set (pathway) \(S\)
  • \(k\) is the number of genes of interest that are annotated to \(S\)

ORA example

  • A background of 10,000 genes, of which 260 are categorized as “axon guidance” (\(S\)).
  • We find that 1000 genes are differentially expressed, and 50 of those are categorized as “axon guidance.”
  • Is this statistically significant over representation?

ORA example

phyper(q = 50-1,m = 260,n=10000-260,k=1000,lower.tail = F)
[1] 3.825066e-06
m = matrix(c(50,1000-50,260-50,10000-1000-260+50),nr=2)
colnames(m) = c("DE","Not DE")
rownames(m) = c("In S","Not in S")
kable(m)
DE Not DE
In S 50 210
Not in S 950 8790
fisher.test(m,alternative = "g")$p.value
[1] 3.825066e-06

ORA problems

  • Does not automatically account for the direction of differential expression
    • Can look at down-regulated and up-regulated genes separately.
  • Also does not take effect size into account
    • Will miss small but coordinated changes across lots of genes
      • These are likely more biologically relevant.

Gene set enrichment analysis (GSEA)

  • Basically, the goal is to test whether members of a gene set \(S\) are distributed randomly throughout a gene list \(L\).6
  • An enrichment score (ES) is calculated by:
    1. Walking down the gene list \(L\) (usually ranked by effect size/correlation with the phenotype).
    2. When a gene is in set \(S\), the ES increases, and decreases when not in \(S\).
      • Increment depends on the statistic in \(L\)

Gene set enrichment analysis (GSEA)

  1. The final ES is the maximum deviation from 0, and corresponds to a weighted Kolmogorov–Smirnov-like statistic.6
  2. The statistical significance of the ES is estimated using an empirical phenotype-based permutation test. Shuffling the phenotype preserves gene-gene correlations, and is better than shuffling gene labels.

clusterProfiler

  • Essentially, a set of wrapper functions that simplify functional enrichment analyses.
  • Provides really nice visualizations for pathway analyses.
  • Also includes utility functions that simplify conversion between gene identifiers (e.g. from gene symbol to Entrez ID).
  • Allows for comparison of multiple gene lists.
  • Uses dplyr verbs (mutate(), etc.) for altering figures and results.

Reactome ORA example

library(clusterProfiler)
library(ReactomePA)
data(geneList, package="DOSE")
head(geneList)
    4312     8318    10874    55143    55388      991 
4.572613 4.514594 4.418218 4.144075 3.876258 3.677857 
de <- names(geneList)[abs(geneList) > 1.5]
head(de)
[1] "4312"  "8318"  "10874" "55143" "55388" "991"  

Reactome ORA example

x <- enrichPathway(gene=de)
res = x@result
kable(head(res[,1:6]),row.names = F)
ID Description GeneRatio BgRatio pvalue p.adjust
R-HSA-69620 Cell Cycle Checkpoints 38/330 293/10899 0 0
R-HSA-69618 Mitotic Spindle Checkpoint 22/330 113/10899 0 0
R-HSA-2500257 Resolution of Sister Chromatid Cohesion 23/330 126/10899 0 0
R-HSA-141424 Amplification of signal from the kinetochores 20/330 96/10899 0 0
R-HSA-141444 Amplification of signal from unattached kinetochores via a MAD2 inhibitory signal 20/330 96/10899 0 0
R-HSA-9648025 EML4 and NUDC in mitotic spindle formation 20/330 117/10899 0 0

Reactome ORA barplot

library(enrichplot)
barplot(x, showCategory=10) 

Reactome ORA dotplot

dotplot(x, showCategory=10) 

Reactome ORA gene-concept network

library(DOSE)
# Convert gene ID to Symbol
edox <- setReadable(x, 'org.Hs.eg.db', 'ENTREZID')
# Network plot
cnetplot(edox, foldChange=geneList)

# Circular plot
cnetplot(edox, foldChange=geneList, circular = TRUE)

Reactome ORA gene-concept network

Viewing a specific pathway

viewPathway("E2F mediated regulation of DNA replication", 
            readable = TRUE, 
            foldChange = geneList)

Reactome ORA heatmap

heatplot(edox, foldChange=geneList, showCategory=10)

Reactome ORA treeplot

Treeplots give you a slightly higher level overview by clustering terms based on similarity.

KEGG GSEA

  • All of the previous plots can be used with either ORA or GSEA. I would recommend using GSEA results whenever possible.
  • There are some GSEA-specific plots that can be useful as well.

KEGG GSEA ridgeplot

  • The ridgeplot helps users to interpret up vs. down-regulated pathways.
x <- gseKEGG(geneList)
edox <- setReadable(x, 'org.Hs.eg.db', 'ENTREZID')
edox = pairwise_termsim(edox)

KEGG GSEA ridgeplot

ridgeplot(edox)

KEGG GSEA statistic

gseaplot2(edox, geneSetID = 1:3, pvalue_table = TRUE,
          color = c("#E495A5", "#86B875", "#7DB0DD"), ES_geom = "dot")

Theme comparison

  • The compareCluster function allows you to perform enrichment analysis on multiple gene lists at once.
  • There is a formula interface that allows for somewhat complex comparisons.
df <- data.frame(Entrez=names(geneList), FC=geneList)
df$group <- "upregulated"
df$group[df$FC < 0] <- "downregulated"
df$othergroup <- "A"
df$othergroup[abs(df$FC) > 2] <- "B"

Theme comparison

kable(head(df))
Entrez FC group othergroup
4312 4312 4.572613 upregulated B
8318 8318 4.514594 upregulated B
10874 10874 4.418218 upregulated B
55143 55143 4.144075 upregulated B
55388 55388 3.876258 upregulated B
991 991 3.677857 upregulated B

Theme comparison

formula_res <- compareCluster(Entrez~group+othergroup, data=df, fun="enrichKEGG")
dotplot(formula_res, x="group") + facet_grid(~othergroup)

Theme comparison

Theme comparison

data(gcSample)
str(gcSample) 
List of 8
 $ X1: chr [1:216] "4597" "7111" "5266" "2175" ...
 $ X2: chr [1:805] "23450" "5160" "7126" "26118" ...
 $ X3: chr [1:392] "894" "7057" "22906" "3339" ...
 $ X4: chr [1:838] "5573" "7453" "5245" "23450" ...
 $ X5: chr [1:929] "5982" "7318" "6352" "2101" ...
 $ X6: chr [1:585] "5337" "9295" "4035" "811" ...
 $ X7: chr [1:582] "2621" "2665" "5690" "3608" ...
 $ X8: chr [1:237] "2665" "4735" "1327" "3192" ...
ck <- compareCluster(geneCluster = gcSample, fun = enrichKEGG)
ck <- setReadable(ck, OrgDb = org.Hs.eg.db, keyType="ENTREZID")

Theme comparison

Questions?

References

1.
Yu G. Biomedical Knowledge Mining Using GOSemSim and clusterProfiler. Accessed January 25, 2023. https://yulab-smu.top/biomedical-knowledge-mining-book/index.html
2.
Gene Ontology overview. Accessed January 25, 2023. http://geneontology.org/docs/ontology-documentation/
3.
Kanehisa M, Furumichi M, Sato Y, Kawashima M, Ishiguro-Watanabe M. KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Research. 2023;51(D1):D587-D592. doi:10.1093/nar/gkac963
4.
Gillespie M, Jassal B, Stephan R, et al. The reactome pathway knowledgebase 2022. Nucleic Acids Research. 2022;50(D1):D687-D692. doi:10.1093/nar/gkab1028
5.
Boyle EI, Weng S, Gollub J, et al. GO::TermFinder—open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics. 2004;20(18):3710-3715. doi:10.1093/bioinformatics/bth456
6.
Subramanian A, Tamayo P, Mootha VK, et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA. 2005;102(43):15545-15550. doi:10.1073/pnas.0506580102